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1  Summary  of  program  objectives  and  outcomes 


The  goal  of  this  work  was  to  develop  sequential  prediction  methods  for  online  information  fusion  and 
control,  with  methods  designed  to  handle  unknown,  environmental  dynamics,  potentially  stemming 
from  an  adversary  who  reacts  to  sensing  actions,  active  sensing  paradigms,  and  external  feedback  mech¬ 
anisms.  Online  prediction  and  targeted  collection  of  information  is  an  emerging  paradigm  at  the  inter¬ 
section  of  optimization,  machine  learning  and  control  theory,  which  is  concerned  with  real-time  sequen¬ 
tial  planning  of  actions  or  decisions  in  the  presence  of  model  uncertainty,  nonstationarity,  and  possibly 
adversarial  disturbances. 

Our  team  has  developed  several  methods  and  underlying  supporting  theory  which  meet  these  ob¬ 
jectives: 

•  Early  in  the  course  of  the  program,  we  developed  methods  for  online  learning  using  external  feed¬ 
back  to  efficiently  guide  active  data  collection  and  utilization  of  expensive  expert  system  resources. 
In  particular,  we  developed  an  online  convex  programming  approach  to  sequential  probability  as¬ 
signment  of  high- dimensional  co-occurrence  data  [25].  Our  approach  consists  of  two  main  el¬ 
ements:  (1)  filtering,  or  assigning  a  belief  or  likelihood  to  each  successive  measurement  based 
upon  our  ability  to  predict  it  from  previous  noisy  observations,  and  (2)  hedging,  or  flagging  poten¬ 
tial  anomalies  by  comparing  the  current  belief  against  a  time-varying  and  data-adaptive  threshold. 
The  threshold  is  adjusted  based  on  the  available  feedback  from  an  end  user.  Our  algorithms,  which 
combine  universal  prediction  with  recent  work  on  online  convex  programming,  do  not  require 
computing  posterior  distributions  given  all  current  observations  and  involve  simple  primal-dual 
parameter  updates.  At  the  heart  of  our  approach  lie  exponential-family  models  which  can  be  used 
in  a  wide  variety  of  contexts  and  applications,  and  which  yield  performance  comparable  to  that 
of  a  batch  algorithm  which  has  access  to  all  data  at  once  rather  than  sequentially,  methods  that 
achieve  sublinear  per-round  regret  against  both  static  and  slowly  varying  product  distributions 
with  marginals  drawn  from  the  same  exponential  family.  Moreover,  the  regret  against  static  distri¬ 
butions  coincides  with  the  minimax  value  of  the  corresponding  online  strongly  convex  game.  We 
also  prove  bounds  on  the  number  of  mistakes  made  during  the  hedging  step  relative  to  the  best  of¬ 
fline  choice  of  the  threshold  with  access  to  all  estimated  beliefs  and  feedback  signals.  Furthermore, 
our  approach  is  provably  robust  to  unknown  environmental  dynamics  and  unmodeled  statistical 
dependencies.  As  described  in  [19],  our  computationally  efficient  sequential  anomaly  detection 
using  feedback  can  be  used  as  an  effective  pre-processing  step  for  large  volumes  of  social  network 
data  associated  with  encrypted  or  other  contextual  information  which  can  only  be  analyzed  with 
resource-intensive  expert  systems. 

•  Online  optimization  methods  are  often  designed  to  have  a  total  accumulated  loss  comparable  to 
that  achievable  by  some  comparator,  such  as  a  batch  algorithm  with  access  to  all  the  data  and 
infinite  computational  resources.  In  many  settings,  this  comparator  is  allowed  to  vary  with  time, 
and  the  associated  “tracking  regret”  bounds  scale  with  the  overall  variation  of  the  comparator  se¬ 
quence.  However,  in  practical  scenarios  ranging  from  motion  imagery  to  network  analysis,  the 
environment  is  nonstationary  and  comparator  sequences  with  small  variation  are  quite  weak,  re¬ 
sulting  in  large  losses.  Our  work  describes  a  “dynamic  mirror  descent”  method  which  addresses 
this  challenge,  yielding  low  regrets  bounds  for  comparators  with  small  deviations  from  a  given  dy¬ 
namical  model.  This  approach  is  then  used  within  a  broader  class  of  online  learning  methods  to 
simultaneously  track  the  best  dynamical  model  and  form  predictions  based  on  that  model.  This 
concept  is  demonstrated  empirically  in  the  context  of  sequential  compressed  sensing  of  a  dynamic 
scene,  solar  flare  detection  from  satellite  data  with  missing  elements,  and  tracking  a  dynamic  so¬ 
cial  network  [13,  14,  15]. 
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•  We  also  considered  an  online  (real-time)  control  problem  that  involves  an  agent  performing  a 
discrete-time  random  walk  over  a  finite  state  space.  The  agent’s  action  at  each  time  step  is  to 
specify  the  probability  distribution  for  the  next  state  given  the  current  state.  Following  the  set-up 
of  Todorov,  the  state-action  cost  at  each  time  step  is  a  sum  of  a  state  cost  and  a  control  cost  given 
by  the  Kullback-Leibler  (KL)  divergence  between  the  agent’s  next-state  distribution  and  that  deter¬ 
mined  by  some  fixed  passive  dynamics.  The  online  aspect  of  the  problem  is  due  to  the  fact  that 
the  state  cost  functions  are  generated  by  a  dynamic  environment,  and  the  agent  learns  the  current 
state  cost  only  after  selecting  an  action.  An  explicit  construction  of  a  computationally  efficient 
strategy  with  small  regret  (i.e.,  expected  difference  between  its  actual  total  cost  and  the  smallest 
cost  attainable  using  noncausal  knowledge  of  the  state  costs)  under  mild  regularity  conditions  is 
presented,  along  with  a  demonstration  of  the  performance  of  the  proposed  strategy  on  a  simulated 
target  tracking  problem.  A  number  of  new  results  on  Markov  decision  processes  with  KL  control 
costare  also  obtained  [10,  11]. 

Most  recently,  we  have  built  upon  and  to  some  extent  unified  the  above  results  to  develop  a  general 
procedure  for  developing  low-regret  algorithms  for  online  Markov  decision  processes.  This  culminating 
work  is  described  in  detail  in  this  report.  Online  learning  algorithms  can  deal  with  nonstationary  envi¬ 
ronments,  but  generally  there  is  no  notion  of  a  dynamic  state  to  model  constraints  on  current  and  future 
actions  as  a  function  of  past  actions.  State-based  models  are  common  in  stochastic  control  settings,  but 
commonly  used  frameworks  such  as  Markov  Decision  Processes  (MDPs)  assume  a  known  stationary  en¬ 
vironment.  In  recent  years,  there  has  been  a  growing  interest  in  combining  the  above  two  frameworks 
and  considering  an  MDP  setting  in  which  the  cost  function  is  allowed  to  change  arbitrarily  after  each 
time  step.  However,  most  of  the  work  in  this  area  has  been  algorithmic:  given  a  problem,  one  would 
develop  an  algorithm  almost  from  scratch.  Moreover,  the  presence  of  the  state  and  the  assumption  of  an 
arbitrarily  varying  environment  complicate  both  the  theoretical  analysis  and  the  development  of  com¬ 
putationally  efficient  methods.  This  report  describes  a  broad  extension  of  the  ideas  proposed  by  Rakhlin, 
Shamir,  and  Sridharan  [27]  to  give  a  general  framework  for  deriving  algorithms  in  an  MDP  setting  with 
arbitrarily  changing  costs.  This  framework  leads  to  a  unifying  view  of  existing  methods  and  provides  a 
general  procedure  for  constructing  new  ones.  One  such  new  method  is  presented  and  shown  to  have 
important  advantages  over  a  similar  method  developed  outside  the  framework  proposed  in  this  report. 

2  Relationship  between  program  outcomes  and  previous  state  of  the  art 

Markov  decision  processes,  or  MDPs  for  short  [2, 17, 24],  are  a  popular  framework  for  sequential  decision¬ 
making  in  a  dynamic  environment.  In  an  MDP,  we  have  states  and  actions.  At  each  time  step  of  the  se¬ 
quential  decision-making  process,  the  agent  observes  the  current  state  and  chooses  an  action,  and  the 
system  transitions  to  the  next  state  according  to  a  fixed  and  known  Markov  law.  The  costs  incurred  by  the 
agent  depend  both  on  his  action  and  the  current  state.  Traditional  theory  of  MDPs  deals  with  the  case 
when  both  the  transition  law  and  the  state-action  cost  function  are  known  in  advance.  In  this  case,  the 
problem  of  policy  design  reduces  to  dynamic  programming.  However,  a  priori  known  costs  are  typically 
unavailable  in  practical  settings.  In  this  report,  instead  of  considering  a  fixed  cost  function,  we  study 
Markov  decision  processes  with  finite  state  and  action  spaces,  where  the  cost  functions  are  chosen  arbi¬ 
trarily  and  allowed  to  change  with  time.  More  specifically,  we  are  interested  in  the  online  MDP  problem: 
just  as  in  the  usual  online  leaning  framework  [7,  16,  28],  the  one-step  cost  functions  form  an  arbitrarily 
varying  sequence,  and  the  cost  function  corresponding  to  each  time  step  is  revealed  to  the  agent  after 
an  action  has  been  taken.  The  objective  of  the  agent  is  to  minimize  regret  relative  to  the  best  stationary 
Markov  policy  that  could  have  been  selected  with  full  knowledge  of  the  cost  function  sequence  over  the 
horizon  of  interest.  The  assumption  of  arbitrary  time-varying  cost  functions  makes  sense  in  highly  un- 
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certain  and  complex  environments  whose  temporal  evolution  may  be  difficult  or  costly  to  model,  and  it 
also  accounts  for  collective  (and  possibly  irrational)  behavior  of  any  other  agents  that  may  be  present. 
The  regret  minimization  viewpoint  then  ensures  that  the  agent’s  online  policy  is  robust  against  these 
effects. 

Online  MDP  problems  can  be  viewed  as  online  control  problems.  The  online  aspect  is  due  to  the 
fact  that  the  cost  functions  are  generated  by  a  dynamic  environment  under  no  distributional  assump¬ 
tions,  and  the  agent  learns  the  current  state-action  cost  only  after  selecting  an  action.  The  control  aspect 
comes  from  the  fact  that  the  choice  of  an  action  at  each  time  step  influences  future  states  and  costs.  Tak¬ 
ing  into  account  the  effect  of  past  actions  on  future  costs  in  a  dynamic  distribution-free  setting  makes 
online  MDPs  hard  to  solve.  To  the  best  of  our  knowledge,  only  a  few  methods  have  been  developed  in 
this  area  over  the  past  decade  [1,  3,  9,  12,  21,  23,  33].  Most  research  in  this  area  has  been  algorithmic: 
given  a  problem,  one  would  present  a  method  and  prove  a  guarantee  (i.e.,  a  regret  bound)  on  its  perfor¬ 
mance.  Thus,  it  is  desirable  to  provide  a  unifying  view  of  existing  methods  and  a  general  procedure  for 
constructing  new  ones.  In  this  report,  we  present  such  a  general  framework  for  online  MDP  problems, 
bringing  two  well-known  existing  methods  under  a  single  theoretical  interpretation.  This  general  frame¬ 
work  not  only  enables  us  to  recover  known  algorithms,  but  it  also  gives  us  a  generic  toolbox  for  deriving 
new  algorithms  from  a  more  principled  perspective  rather  than  from  scratch. 

The  online  MDP  setting  we  are  considering  was  first  defined  and  studied  in  the  work  of  [9]  and  [33] , 
which  deals  with  MDPs  with  arbitrarily  varying  rewards.  Like  these  authors,  we  assume  a  full  information 
feedback  model  and  known  stochastic  state  transition  dynamics.  (However,  it  should  be  pointed  out 
that  these  assumptions  have  been  relaxed  in  some  recent  works  —  for  example,  [23]  and  [3]  assume  only 
bandit-type  feedback,  while  [1]  prove  regret  bounds  for  MDPs  with  arbitrarily  varying  transition  models 
and  cost  functions.  An  extension  of  our  framework  to  these  settings  is  an  interesting  avenue  for  future 
research.) 

Our  general  approach  is  motivated  by  recent  work  of  [27] ,  which  gives  a  principled  way  of  deriving 
online  learning  algorithms  (and  bounding  their  regret)  from  a  minimax  analysis.  Of  course,  many  online 
learning  algorithms  have  been  developed  in  various  settings  over  the  past  few  decades,  but  a  comprehen¬ 
sive  and  systematic  treatment  was  still  lacking.  Starting  from  a  general  formulation  of  online  learning  as 
a  repeated  game  between  a  learner  and  an  adversary,  [27]  analyze  the  minimax  value  of  this  online  learn¬ 
ing  game.  It  was  known  before  [26]  that  one  could  derive  sublinear  non-constructive  upper  bounds  on 
the  minimax  value.  However,  algorithm  design  was  done  on  a  case-by-case  basis,  and  a  separate  analysis 
was  needed  in  each  case  to  derive  performance  guarantees  matching  these  upper  bounds.  The  work  of 
[27]  bridges  this  gap  between  minimax  value  analysis  and  algorithm  design:  They  have  shown  that,  by 
choosing  appropriate  relaxations  of  a  certain  recursive  decomposition  of  the  minimax  value,  one  can 
recover  many  known  online  learning  algorithms  and  give  a  general  recipe  for  developing  new  ones.  In 
short,  the  framework  proposed  by  [27]  can  be  used  to  convert  an  upper  bound  on  the  value  of  the  game 
into  an  algorithm. 

Our  main  contribution  is  an  extension  of  the  framework  of  [27]  to  online  MDPs.  Since  online  learning 
problems  are  studied  in  a  state-free  setting,  it  is  not  straightforward  to  generalize  the  ideas  of  [27]  to  the 
case  when  the  system  has  a  state,  and  the  technical  nature  of  the  arguments  involved  in  online  MDPs 
is  significantly  heavier  than  their  state-free  counterpart.  We  formulate  the  online  MDP  problem  as  a 
two-player  repeated  game  and  study  its  minimax  value  in  the  presence  of  state  variables.  We  introduce 
the  notion  of  an  online  MDP  relaxation  and  show  how  it  can  be  used  to  recover  existing  methods  and 
to  construct  new  algorithms.  More  specifically,  we  use  Poisson  inequalities  for  MDPs  [22]  to  move  from 
a  state-dependent  setting  to  a  state-free  one.  As  a  consequence,  each  possible  state  value  is  associated 
with  an  individual  online  learning  algorithm.  We  show  that  the  algorithm  proposed  by  [9]  arises  from 
a  particular  relaxation,  and  we  also  derive  a  new  algorithm  in  the  spirit  of  [33]  which  exhibits  improved 
regret  bounds. 
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The  remainder  of  the  report  is  organized  as  follows.  We  close  this  section  with  a  brief  summary  of  our 
results  and  frequently  used  notation.  Section  3  contains  precise  formulation  of  the  online  MDP  problem 
and  points  out  the  general  idea  and  major  challenges.  Section  4  describes  our  proposed  framework  and 
contains  the  main  result.  Section  5  shows  the  power  of  our  framework  by  recovering  an  existing  method 
proposed  in  [9]  and  further  derives  a  new  algorithm  using  the  framework.  Section  6  contains  discussion 
about  future  research.  Proofs  of  all  intermediate  results  are  relegated  to  the  Appendix. 

2.1  A  summary  of  results 

We  start  by  recasting  an  MDP  with  arbitrary  costs  as  a  one-sided  stochastic  game,  where  an  agent  who 
wishes  to  minimize  his  long-term  average  cost  is  facing  a  Markovian  environment,  which  is  also  affected 
by  arbitrary  actions  of  an  opponent.  A  stochastic  game  [31]  is  a  repeated  two-player  game,  where  the 
state  changes  at  every  time  step  according  to  a  transition  law  depending  on  the  current  state  and  the 
moves  of  both  players.  Here  we  are  considering  a  special  type  of  a  stochastic  game,  where  the  agent 
controls  the  state  transition  alone  and  the  opponent  chooses  the  cost  functions.  By  “one-sided”,  we  mean 
that  the  utility  of  the  opponent  is  left  unspecified.  In  other  words,  we  do  not  need  to  study  the  strategy 
and  objectives  of  the  opponent,  and  only  assume  that  the  changes  in  the  environment  in  response  to  the 
opponent’s  moves  occur  arbitrarily.  As  a  result,  we  simply  model  the  opponent  as  the  environment. 

A  popular  and  common  objective  in  such  settings  is  regret  minimization.  Regret  is  defined  as  the 
difference  between  the  cost  the  agent  actually  incurred,  and  what  could  have  been  incurred  if  the  agent 
knew  the  observed  sequence  of  cost  functions  in  advance.  We  will  give  the  precise  definition  of  this  regret 
notion  in  Section  3.  We  start  by  studying  the  minimax  regret,  i.e.,  the  regret  the  agent  will  suffer  when 
both  the  agent  and  the  environment  play  optimally.  By  applying  the  theory  of  dynamic  programming  for 
stochastic  games  [31],  we  can  give  the  minimax  strategy  for  the  agent  that  achieves  minimax  regret.  It 
can  be  interpreted  as  choosing  the  best  action  that  takes  into  account  the  current  cost  and  the  worst  case 
future.  Unfortunately,  this  minimax  strategy  in  general  is  not  computationally  feasible  due  to  the  fact 
that  the  number  of  possible  futures  grows  exponential  with  time.  The  idea  is  to  find  a  way  to  approximate 
the  term  that  represents  the  “future”  and  derive  near-optimal  strategy  that  is  easy  to  compute  using  the 
approximation. 

Our  main  contribution  is  a  construction  of  a  general  procedure  for  deriving  algorithms  in  the  online 
MDP  setting.  More  specifically: 

1.  Just  as  in  the  state-free  setting  considered  by  [27],  we  argue  that  algorithms  can  be  constructed 
systematically  by  first  deriving  a  sequence  of  upper  bounds  ( relaxations )  on  a  quantity  called  se¬ 
quential  Rademacher  complexity,  and  then  plugging  these  upper  bounds  into  a  recursively  defined 
system  of  inequalities  (called  the  admissibility  conditions). 

2.  Once  a  relaxation  and  an  algorithm  are  derived  in  this  way,  we  give  a  general  regret  bound  of  that 
algorithm  as  follows: 


Expected  regret  <  Relaxation  +  Stationarization  error. 

The  first  term  on  the  right-hand  side  of  the  above  inequality  is  related  to  the  derived  relaxation, 
while  the  second  term  is  an  approximation  error  that  results  from  approximating  the  Markovian 
evolution  of  the  underlying  process  by  a  simpler  steady-state  using  a  procedure  we  refer  to  as  sta¬ 
tionarization.  The  first  term  can  be  analyzed  using  essentially  the  same  techniques  as  the  ones 
employed  by  [27] ,  with  some  modifications;  by  contrast,  the  second  term  can  be  handled  using 
only  Markov  chain  methods.  This  approach  significantly  alleviates  the  technical  burden  of  prov¬ 
ing  a  regret  bound  as  in  the  literature  before  our  work. 
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3.  Using  the  above  procedure,  we  recover  an  existing  method  proposed  in  [9],  which  achieves  0(\/r) 
expected  regret  against  the  best  stationary  policy.  We  show  that  our  derived  relaxation  gives  us  the 
same  exponentially  weighted  average  forecaster  as  in  [9]  and  leads  to  the  same  regret  bound. 

4.  We  also  derive  a  new  algorithm  using  our  proposed  framework  and  argue  that,  while  this  new 
algorithm  is  similar  in  nature  to  the  work  of  [33] ,  it  has  several  advantages  —  in  particular,  better 
scaling  of  the  regret  with  the  horizon  T. 

2.2  Notation 

We  will  denote  the  underlying  finite  state  space  and  action  space  by  X  and  U,  respectively.  The  set  of  all 
probability  distributions  on  X  will  be  denoted  by  ^(X),  and  the  same  goes  for  U  and  U).  A  matrix  P  = 
[P{u\x)]X€X,ueU  with  nonnegative  entries,  and  with  the  rows  and  the  columns  indexed  by  the  elements  of 
U  and  X  respectively,  is  called  Markov  (or  stochastic )  if  its  rows  sum  to  one:  Y.u<e u  P(zt|x)  =  1,  Vx  e  X.  We 
will  denote  the  set  of  all  such  Markov  matrices  (or  randomized  state  feedback  laws)  by  |X) .  Markov 

matrices  in  |X)  transform  probability  distributions  on  X  into  probability  distributions  on  U :  for  any 
p  e  ^(X)  and  any  P  e  ( U [ X) ,  we  have 

pP(u)  =  Y  p(x)P(w|x),  Vue  U. 

JC£X 

The  same  applies  to  Markov  matrices  on  X  and  to  their  action  on  the  elements  of  SP{X). 

The  fixed  and  known  stochastic  transition  kernel  of  the  MDP  will  be  denoted  throughout  by  K  -  that 
is,  K(y\x,  u )  is  the  probability  that  the  next  state  is  y  if  the  current  state  is  x  and  the  action  u  is  taken.  For 
any  Markov  matrix  (randomized  state  feedback  law)  P  U|X),  we  will  denote  by  K(y\x,  P)  the  Markov 
kernel 


K(y\x,P)=  Y  K{y\x,u)P(u\x). 

U€  U 


Similarly,  for  any  v  eg?  ( U), 


K[y\x,v)  =  Y  K{y\x,u)v{u) 

U€  U 

(this  can  be  viewed  as  a  special  case  of  the  previous  definition  if  we  interpret  v  as  a  state  feedback  law 
that  ignores  the  state  and  draws  a  random  action  according  to  v).  For  any  p  e  £P(X)  and  P  e  .V*t(U|X), 
p  <8>  P  denotes  the  induced  joint  state-action  distribution  on  X  x  U: 

p®P(x, u)  =  p(x)P(w|x),  V(x, u)  eXx  U. 

We  say  that  P  is  unichain  [18]  if  the  corresponding  Markov  chain  with  transition  kernel  K(-\-,P)  has  a 
single  recurrent  class  of  states  (plus  a  possibly  empty  transient  class).  This  is  equivalent  to  the  induced 
kernel  iC(-  |P)  having  a  unique  invariant  distribution  np  [29] . 

The  total  variation  (or  L\)  distance  between  vi,V2  £  £?{ U)  is 

llvi-v2||i=  Y  Ivi(u)-v2(u)|. 

well 

It  admits  the  following  variational  representation: 

l|Vl-V2||i=  sup  |<Vi,/>-<V2,/>|,  (1) 

f'  II /'ll  OO  —  l 
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where  the  supremum  is  over  all  functions  / :  U  —  R  with  absolute  value  bounded  by  1,  and  we  are  using 
the  linear  functional  notation  for  expectations: 


<v,/>  =  EV[/]  =  E  v(m)/(z/). 
l(£U 

The  Kullback-Leibler  divergence  (or  relative  entropy)  between  Vi  and  V2  [8]  is 


D(V!  ||v2)  =  < 


r  \  i  Vi(.d) 

E  Vi(u)  log  — 
MEU  V2  ill) 

+oo 


if  supp(vi)  £  supp(v2) 
otherwise 


where  supp(v)  =  {u  e  U  :  v(u)  >  0}  is  the  support  of  v.  Here  and  in  the  sequel,  we  work  with  natural 
logarithms.  The  same  applies,  mutatis  mutandis,  to  probability  distributions  on  X. 

We  will  also  be  dealing  with  binary  trees  that  arise  in  symmetrization  arguments,  as  in  [27] :  Let 
be  an  arbitrary  set.  An  ^if-valued  tree  h  of  depth  d  is  defined  as  a  sequence  (hi , . . . ,  h^)  of  mappings 
hf :  [±l]f_1  —i •  for  t-  1,2  Given  a  tuple  e  =  £  [+l}rf,  we  will  often  write  hf(e)  instead 

ofhf(£l:f-l). 


3  Problem  formulation 


We  consider  an  online  MDP  with  finite  state  and  action  spaces  X  and  U  and  transition  kernel  K{y \x,  u). 
Let  &  be  a  fixed  class  of  functions  /  :  X  x  U  -*  R,  and  let  x  e  X  be  a  fixed  initial  state.  Consider  an  agent 
performing  a  controlled  random  walk  on  X  in  response  to  signals  coming  from  the  environment.  The 
agent  is  using  mixed  strategies  to  choose  actions,  where  a  mixed  strategy  is  a  probability  distribution 
over  the  action  space.  The  interaction  between  the  agent  and  the  environment  proceeds  as  follows: 

Xi  =x 

for  t  =  1,2, T 

The  agent  observes  the  state  Xt  and  selects  a  mixed  strategy  Pt  e  ££*(U) 

The  environment  selects  ft  e  S'  and  announces  it  to  the  agent 

The  agent  draws  an  action  Ut  from  Pt  and  incurs  one-step  cost  ft(Xt,  Ut). 

The  system  transitions  to  the  next  state  Xt+\  ~  K{- \Xt,  Ut) 
end  for 

Here,  T  is  a  fixed  finite  horizon.  We  assume  throughout  that  the  environment  is  oblivious  (or  open-loop), 
in  the  sense  that  the  evolution  of  the  sequence  {_/>}  is  not  affected  by  the  state  and  action  sequences 
{Xt}  and  {Ur}-  We  view  the  above  process  as  a  two-player  repeated  game  between  the  agent  and  the  en¬ 
vironment.  At  each  t  >  1,  the  process  is  at  state  Xt  =  xt.  The  agent  observes  the  current  state  xt  and 
selects  the  mixed  strategy  Pt,  where  Pt{u\xt)  =  Pr {Ut  =  u\Xt  =  xt],  based  on  his  knowledge  of  all  the 

previous  states  and  current  state  xr  and  the  previous  moves  of  the  environment  /f_1  =  (/i . 

After  drawing  the  action  Ut  from  Pt,  the  agent  incurs  the  one-step  cost  ft(Xt,  Ut).  Adopting  game- 
theoretic  terminology  [4],  we  define  the  agent’s  closed-loop  behavioral  strategy  as  a  tuple  y  =  (yi, . . .  ,Jt), 
where  yf  :  X1  x  !?■  1  —  1X(U).  Similarly,  the  environment’s  open-loop  behavioral  strategy  is  a  tuple 

/  =  (/i,...,/f).  Once  the  initial  state  X\  =  x  and  the  strategy  pair  iy,f)  are  specified,  the  joint  distri¬ 
bution  of  the  state-action  process  (Xr,  UT)  is  well-defined. 

Let  Mq  -  ^o(U|X)  e^(U|X)  denote  the  subset  of  all  Markov  policies  P,  for  which  the  induced  state 
transition  kernel  K[-\-,P)  has  a  unique  invariant  distribution  n p  e  PP(X).  The  goal  of  the  agent  is  to  min¬ 
imize  the  expected  steady-state  regret 


Rp^r/ 


T 


E  Mxt>  ut) 

t=i 


inf  E 


YftiX,  U)  |, 


(2) 
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where  the  outer  expectation  E*’^  is  taken  w.r.t.  both  the  Markov  chain  induced  by  the  agent’s  behavioral 
strategy  y  (including  randomization  of  the  agent’s  actions),  the  environment’s  behavior  strategy  /,  and 
the  initial  state  X\  =  x.  The  inner  expectation  (after  the  infimum)  is  w.r.t.  the  state-action  distribution 
Tip  <8>  P{x,  u)  =  7ip(x)P(u\x),  where  Tip  denotes  the  unique  invariant  distribution  of  K{-\-,P).  The  regret 
can  be  interpreted  as  the  gap  between  the  expected  cumulative  cost  of  the  agent  using  strategy 
y  and  the  best  steady-state  cost  the  agent  could  have  achieved  in  hindsight  by  using  the  best  stationary 
policy  P  e  Mq  (with  full  knowledge  of  /  =  fT) .  This  gap  arises  through  the  agent’s  lack  of  prior  knowledge 
on  the  sequence  of  cost  functions. 

Here  we  consider  the  steady- state  regret,  so  that  the  expectation  w.r.t.  the  state  evolution  in  the  com¬ 
parator  term  E  [L^=i  ft(X,  IJ)  \  is  taken  over  the  invariant  distribution  Tip  instead  of  the  Markov  transition 
law  K{-\-,P)  induced  by  P.  Under  the  additional  assumptions  that  the  cost  functions  ft  are  uniformly 
bounded  and  the  induced  Markov  chains  P)  are  uniformly  exponentially  mixing  for  all  P  U|X), 
the  difference  we  introduce  here  by  considering  the  steady  state  is  bounded  by  a  constant  independent 
of  T  [9,  33],  and  so  is  negligible  in  the  long  run.  In  our  main  results,  we  only  consider  baseline  poli¬ 
cies  in  Jt o  that  are  uniformly  exponentially  mixing,  so  we  restrict  our  attention  to  the  steady-state  regret 
without  any  loss  of  generality. 


3.1  Minimax  value 


We  start  our  analysis  by  studying  the  value  of  the  game  (the  minimax  regret),  which  we  first  write  down 
in  strategic  form  as 


V  ( x )  =  inf  sup  Rrx’f  =  inf  sup  E^ 


T 


YffXuUt)-'Y(f) 

r=t 


(3) 


where  we  have  introduced  the  shorthand  'F  for  the  comparator  term: 


W{f)  =  inf  E 

P€jt0 


T 


E/dxt/) 

f=t 


In  operational  terms,  V (x)  gives  the  best  value  of  the  regret  the  agent  can  secure  by  any  closed-loop  be¬ 
havioral  strategy  against  the  worst-case  choice  of  an  open-loop  behavioral  strategy  of  the  environment. 
However,  the  strategic  form  of  the  value  hides  the  timing  protocol  of  the  game,  which  encodes  the  infor¬ 
mation  available  to  the  agent  at  each  time  step.  To  that  end,  we  give  the  following  equivalent  expression 
of  V{x)  in  extensive  form: 


Proposition  1  The  minimax  value  (3)  is  given  by 


V  (x)  =  inf  sup . . .  inf  sup  E 

Pi  fl  Pt  fT 


T 


ZMxt,Ut)-'P{f) 

r=l 


(4) 


Proof:  See  Appendix  A.  ■ 

From  this  minimax  formulation,  we  can  immediately  get  an  optimal  algorithm  that  attains  the  minimax 
value.  To  see  this,  we  give  an  equivalent  recursive  form  for  the  value  of  the  game.  For  any  t  e  {0, 1, . . . ,  T  - 
1},  any  given  prefix  fc  =  ifi,...,  ft )  (where  we  let  f°  be  the  empty  tuple  e),  and  any  state  Xt+\  =  x,  define 
the  conditional  value 


Vt (x,/f)=  inf  sup 
ve3»(U)  f 


VT{x,fT)^-W(f). 


fix,  u)v(u)  +  E 

U€  U 


Vt+i[Y,fi,...,ft,f) 


x,  v 


t=T- 1,...,0  (5a) 

(5b) 
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Remark  1  Recursive  decompositions  of  this  sort  arise  frequently  in  problems  involving  decision-making 
in  the  presence  of  uncertainty.  For  instance,  we  may  view  (5)  as  a  dynamic  program  for  a  finite-horizon 
minimax  control  problem  [6].  Alternatively,  we  can  think  of  (5)  as  applying  the  Shapley  operator  [31]  to 
the  conditional  value  in  a  two-player  stochastic  game,  where  one  player  controls  only  the  state  transi¬ 
tions,  while  the  other  player  specifies  the  cost  function.  A  promising  direction  for  future  work  is  to  derive 
some  characteristics  of  the  conditional  value  from  analytical  properties  of  the  Shapley  operator. 

From  Proposition  1 ,  we  see  that  V  (x)  =  Vo  (x,  e) .  Moreover,  we  can  immediately  write  down  the  minimax- 
optimal  behavioral  strategy  for  the  agent: 


7r+i(x, /r)  =  argmin  sup 

veSFU) 


Y  fix,  u)v{u)  +  E 
meU 


f=o,...,r-i. 


Note  that  the  expression  being  minimized  is  a  supremum  of  affine  functions  of  v,  so  it  is  a  lower- 
semicontinuous  function  of  v.  Any  lower- semicontinuous  function  achieves  its  infimum  on  a  compact 
set.  Since  the  probability  simplex  PPiVJ)  is  compact,  we  are  assured  that  a  minimizing  v  always  exists. 
Using  the  above  strategy  at  each  time  step,  we  can  secure  the  minimax  value  in  the  worst-case  scenario. 
Note  also  that  this  strategy  is  very  intuitive:  it  balances  the  tendency  to  minimize  the  present  cost  against 
the  risk  of  incurring  high  future  costs.  However,  with  all  the  future  infimum  and  supremum  pairs  in¬ 
volved,  computing  this  conditional  value  is  intractable.  As  a  result,  the  minimax  optimal  strategy  is  not 
computationally  feasible.  The  idea  is  to  give  tight  bounds  of  the  conditional  value,  which  can  be  mini¬ 
mized  to  form  a  near-optimal  strategy.  We  address  this  challenge  by  developing  computable  bounds  for 
the  conditional  value  functions,  choosing  a  strategy  based  on  these  bounds.  In  general,  tighter  bounds 
yield  lower  regret  and  looser  bounds  are  easier  to  compute,  and  various  online  MDP  methods  occupy 
different  points  in  this  domain. 

In  the  spirit  of  [27],  we  come  up  with  approximations  of  the  conditional  value  Vf(x,/r)  in  (5).  We  say 
that  a  sequence  of  functions  Vt :  X  x  —  R  is  an  admissible  relaxation  if 


Vj(x,  fr)  >  inf  swpl  Y  fix,  u)v{u)  +  E[Vt+iiY,fi,...,  ft,  f)\x,v]\-,  t=T-  1,...,0  (6a) 

v£^u>  f  UeU  J 

Vr(x,/r)  >-¥(/).  (6b) 

We  can  associate  a  behavior  strategy  f  to  any  admissible  relaxation  as  follows: 


?t(x,f 


t-u 


argmmsup  ^ 
veSFU)  [l(eU 


Ie 


fix,  u)v(u)  +  E  Vt{Y,fl,...,ft-l,f) 


x,  v 


Proposition  2  Given  an  admissible  relaxation  {Vt}'l[=()  and  the  associated  behavioral  strategy  f,  for  any 
open-loop  strategy  of  the  environment  we  have  the  regret  bound 


fiF  =  E« 


T 


YftiXuUd-^if) 

t=  l 


<  Vb(x). 


Proof:  See  Appendix  B.  ■ 

Based  on  the  above  sequential  decompositions,  it  suffices  to  restrict  attention  only  to  Markov  strategies 
for  the  agent,  i.e.,  sequences  of  mappings  yf  :  X  x  jFt_1  —  ££>(U)  for  all  t,  so  that  JJt  is  conditionally 
independent  of  Xf_1,  Uf_1  given  Xf,/r_1.  From  now  on,  we  will  just  say  “behavioral  strategy"  and  really 
mean  “Markov  behavioral  strategy.”  In  other  words,  given  Xt,  /t_1 ,  the  history  of  past  states  and  actions 
is  irrelevant,  as  far  as  the  value  of  the  game  is  concerned. 
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Remark  2  What  happens  if  the  environment  is  nonoblivious?  [33]  gave  a  simple  counterexample  of  an 
aperiodic  and  recurrent  MDP  to  show  that  the  regret  is  linear  in  T  regardless  of  the  agent’s  policy  when 
the  opponent  can  adapt  to  the  agent’s  state  trajectory.  We  can  gain  additional  insight  into  the  challenges 
associated  with  an  adaptive  environment  from  the  perspective  of  the  minimax  value.  In  particular,  an 
adaptive  environment’s  closed-loop  behavioral  strategy  is  5  =  (5i,...,<5r)  with  8t :  Xf  x  Ur_1  —  and 

the  corresponding  regret  will  be  given  by 


r  r<8 


Lft&fUt  )-TO 

t=i 


y,S 


<  E 


Y^ftiXt,Ut)  +  VT{XT+l,fT) 

t=\ 

T—l 


Zft(Xt,Ut) 

t=  1 


+  e£’5  [fT{XT,  UT)  +  vTixT+l,f)} . 


Let’s  analyze  the  last  two  terms: 


El’5  [ fT(XT,UT)  +  VT{XT+l,fT )] 


=  f  P(dxT,df)  f  P{duT\xT,fT  l)\fT{x:T,UT)  +  't\vT[XT+i,fT) 
JxT,&T  J  u  1 


In  the  above  conditional  expectation,  /  may  depend  on  the  entire  xT,  so  we  cannot  replace  this  condi¬ 
tional  expectation  by  E[-|xr,yr(^r)l-  This  implies  we  cannot  get  similar  results  as  in  Proposition  2  in  a 
fully  adaptive  environment. 


3.2  Major  challenges 

From  Proposition  2,  we  can  see  that  we  can  bound  the  expected  steady-state  regret  in  terms  of  the  chosen 
relaxation.  Ideally,  if  we  construct  an  admissible  relaxation  by  deriving  certain  upper  bounds  of  the  con¬ 
ditional  value  and  implement  the  associated  behavioral  strategy,  we  obtain  an  algorithm  that  achieves 
the  regret  bound  corresponding  to  the  relaxation.  In  principle,  this  gives  us  a  general  framework  to  de¬ 
velop  low-regret  algorithms  for  online  MDPs.  However,  with  an  additional  state  variable  involved,  it  is 
difficult  to  derive  admissible  relaxations  Vt (x,/f)  to  bound  the  conditional  value.  The  difficulty  stems 
from  the  fact  that  now  the  current  cost  depends  not  only  on  the  current  action,  but  also  on  past  actions. 
Our  plan  is  to  first  find  a  way  to  reduce  this  setting  to  a  simpler  setting  where  there  is  no  Markov  dy¬ 
namics  involved,  and  the  agent  only  has  to  choose  actions.  Then  we  will  be  able  to  incorporate  the  ideas 
of  [26,  27]  into  our  simpler  setting.  More  specifically,  using  Rademacher  complexity  tools  introduced 
by  [26,  27],  we  can  derive  algorithms  in  the  simpler  static  setting  and  then  transfer  them  to  the  original 
problem.  In  the  same  vein,  we  will  also  prove  a  general  regret  bound  for  the  derived  algorithms.  Thus 
we  will  have  a  general  recipe  for  developing  algorithms  and  showing  performance  guarantees  for  online 
MDPs. 


4  The  general  framework  for  constructing  algorithms  in  online  MDPs 

As  mentioned  in  the  above  section,  the  main  challenge  to  overcome  is  the  dependence  of  the  conditional 
value  in  (6)  on  the  state  variable.  Our  plan  is  to  reduce  the  original  online  MDP  problem  to  a  simpler  one, 
where  there  is  no  Markov  dynamics,  and  the  agent  only  has  to  choose  actions. 

We  proceed  with  our  plan  in  several  steps.  First,  we  introduce  a  stationarization  technique  that  will 
allow  us  to  reduce  the  online  MDP  setting  to  a  simpler  setting  without  Markov  dynamics.  This  effectively 
decouples  current  costs  from  past  actions.  Note  that  this  reduction  is  fundamentally  different  from  just 
naively  applying  stateless  online  learning  methods  in  an  online  MDP  setting,  which  would  amount  to  a 
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very  poor  stationarization  strategy  with  larger  errors  and  consequently  large  regret  bounds.  In  contrast, 
our  proposed  stationarization  performs  the  decoupling  with  minimal  loss  in  accuracy  by  exploiting  the 
transition  kernel,  yielding  lower  regret  bounds.  We  then  state  a  new  admissibility  condition  for  relax¬ 
ations  that  differs  from  (6)  in  that  there  is  no  conditioning  on  the  state  variable.  The  advantage  of  work¬ 
ing  with  this  new  type  of  relaxation  is  that  the  corresponding  admissibility  conditions  are  much  easier 
to  verify.  The  main  result  of  this  section  is  that  we  can  apply  any  algorithm  derived  in  the  simpler  static 
setting  to  the  original  dynamic  setting  and  automatically  bound  its  regret. 

4.1  Stationarization 

Our  stationarization  technique  makes  use  of  Poisson  inequalities  for  MDPs  [22]  to  bound  the  regret  de¬ 
fined  in  (2)  in  terms  of  a  different  function  as  opposed  to  the  one-step  cost  function  /. 

As  before,  we  let  K  denote  the  fixed  and  known  transition  law  of  the  MDP.  Following  [9]  and  [33] ,  we 
assume  that  K  is  a  unichain  model,  i.e.,  K[-\-,P)  is  unichain  for  any  choice  of  P  £  ^"(UIX)  —  see  Sec¬ 
tion  2.2  for  definitions.  Thus,  every  state  feedback  law  P  e  (U|X)  belongs  to  Mq.  For  future  reference, 
we  record  the  following  crucial  consequence  of  the  unichain  assumption:  There  exists  a  finite  constant 
r  >  0  such  that  for  all  Markov  policies  P  £  ^(U|X)  and  all  distributions  p!,p2  e  £5^(X), 

||p1^(-|P)-p2^(-|i5)lli<e-1/T||pi-p2||1,  (7) 

where  K[-\P)  £  ^(X|X)  is  the  Markov  matrix  on  the  state  space  induced  by  P.  In  other  words,  the  col¬ 
lection  of  all  state  transition  laws  induced  by  all  Markov  policies  P  is  uniformly  mixing.  Here  we  assume 
thatr  >  1. 

Remark  3  The  unichain  assumption  is  rather  strong,  since  it  places  significant  simultaneous  restric¬ 
tions  on  an  exponentially  large  family  of  Markov  chains  on  the  state  space  (each  chain  corresponds  to 
a  particular  choice  of  a  deterministic  state  feedback  law,  and  there  are  |U||X|  such  laws).  It  is  also  diffi¬ 
cult  to  verify,  since  the  problem  of  determining  whether  an  MDP  is  unichain  is  NP-hard  [32] .  [3]  relax 
the  unichain  assumption  in  a  different  way  by  considering  deterministic  state  transition  dynamics  and 
weakly  communicating  structure,  under  which  it  is  possible  to  move  from  any  state  to  any  other  state 
under  some  policy.  Although  it  is  not  clear  yet  if  we  can  derive  positive  results  with  stochastic  state  tran¬ 
sition  dynamics  and  weakly  communicating  structure,  putting  weaker  assumption  on  state  connectivity 
is  our  goal  in  the  future. 

Consider  now  a  behavioral  strategy  y  =  (y i, _ ,  Jt)  for  the  agent.  For  a  given  choice  /=(/],...,  fr)  of 

costs,  the  following  objects  are  well-defined: 

•  Pj'f  £  ..//7(U|X)  —  the  Markov  matrix  that  governs  the  conditional  distribution  of  Ut  given  Xt,  i.e., 

Pj’f{u\x)  =  [ydx,/'-1)]  (z<); 


•  pj^e^CX)  —  the  distribution  of  Xt\ 

•  Kj  f  £  M (X|X)  —  the  Markov  matrix  that  describes  the  state  transition  from  Xt  to  Xt+\,  i.e., 

Kj’f(y\x)  =  K{y\x,Pj’f)  =  YJK{y\x,  u)Prt  ’f  {u\x); 


•  7 e  —  the  unique  stationary  distribution  of  Kj’^ ,  satisfying  nrt’^  =  nrt’^ Kj ^ , 
tence  and  uniqueness  are  guaranteed  by  virtue  of  the  unichain  assumption; 


where  exis- 
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T]rt'f  =  {nrf^  ®  Pj ^ ,  ft)  —  the  steady-state  cost  at  time  t. 


P  f 

Moreover,  for  any  other  state  feedback  law  P  e  .W(U|X),  we  will  denote  by  qt’J  the  steady-state  cost 
<7 ip  ®  P,  ft),  where  np  is  the  unique  invariant  distribution  of  K{-\-,  P). 

It  will  be  convenient  to  introduce  the  regret  w.r.t.  a  fixed  P  e  Ml U  |X)  with  initial  state  X\  =  x : 


Rr/(P)  =  Erx'f 


±MXt,Ut)-±^f 
t=  1  r=i 

=  E  <A l\’f  ®  ^  ft >  -  ®  P  /r> 

f=i 

as  well  as  the  stationarized  regret 

Rr,fiP) = e 

f=l 

T 

=  e  ®  pf*7,  /f>  -  ®  p  /t> 

Using  (1),  we  get  the  bound 

T 

Rr/  (P)  <  Pr,/(P) + E  ii /f  iiooii  mI’'7  -  ^  ii  i  • 

f=i 


7=1 


(8) 


The  key  observation  here  is  that  the  task  of  analyzing  the  regret  Rf^(P)  splits  into  separately  upper- 
bounding  the  two  terms  on  the  right-hand  side  of  (8):  the  stationarized  regret  PJ  f  (P)  and  the  stationar- 
ization  error  L  f=i  II  ft  II  oo  II  ~  II  l  using  Markov  chain  techniques. 

In  order  to  analyze  the  stationarized  regret,  we  introduce  the  reverse  Poisson  inequality.  Fix  a  Markov 
matrix  P  e  ^(U|X)  and  let  np  e  S?(X)  be  the  (unique)  invariant  distribution  of  P).  Then  we  say  that 
Q  :  X  x  U  —  R  satisfies  the  reverse  Poisson  inequality  with  forcing  function  g :  X  x  U  —  IR  if 


where 


Q(V,P) 


x,  u 


Qix,  u)  >  -g[x,  u)  +  {nP  ®P,g), 


V(x,m)eXx  (J 


(9) 


Q(y-P)  -  E  P(u\y)Qly,u) 

U€  U 

and  E[-|x,  u]  is  w.r.t.  the  transition  law  K(y\x,  u).  We  should  think  of  this  as  a  relaxation  of  the  Poisson 
equation  [22],  i.e.,  when  (9)  holds  with  equality.  The  Poisson  equation  arises  naturally  in  the  theory  of 
Markov  chains  and  Markov  decision  processes,  where  it  provides  a  way  to  evaluate  the  long-term  average 
cost  along  the  trajectory  of  a  Markov  process.  We  are  using  the  term  “reverse  Poisson  inequality”  to  dis¬ 
tinguish  (9)  from  the  Poisson  inequality,  which  also  arises  in  the  theory  of  Markov  chains  and  is  obtained 
by  replacing  >  with  <  in  (9)  [22] .  Here  we  impose  the  following  assumption  that  we  use  throughout  the 
rest  of  the  report: 

Assumption  1  For  any  P  e  .W(U|X)  and  any  f  e  SP ,  there  exists  some  QPj  :  X  x  IJ  —  R  that  solves  the 
reverse  Poisson  inequality  for  P  with  forcing  function  f .  Moreover, 

L(X,  U,Jt)=  sup  sup  IIQfj/lloo  <  oo. 

Pe^(U|X)/£^ 
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Remark  4  In  Section  5,  we  will  show  this  assumption  is  automatically  satisfied  under  an  additional  con¬ 
dition  of  uniform  ergodicity. 

The  main  consequence  of  the  reverse  Poisson  inequality  is  the  following: 

Lemma  1  (Comparison  principle)  Suppose  that  Q  satisfies  the  reverse  Poisson  inequality  (9)  with  forcing 
function  g.  Then  for  any  other  Markov  matrix  P'  we  have 

{np®P,g)~  (rip'  ®P',g)  <  Y^p’MY  [P{u\x)Q{x,  u)  -  P' [u\x)Q{x,  u)\ 

X  U 


Proof:  See  Appendix  C.  ■ 

Armed  with  this  lemma,  we  can  now  analyze  the  stationarized  regret  R?f  (P):  suppose  that,  for  each  t, 
f  satisfies  reverse  Poisson  inequality  for  Pj'^  with  forcing  function  ft.  Then  we  apply  the  comparison 
principle  to  get 

r]rff  -  qP/  <  Y  Ttp  (x)  Y  pTt’f  ( u  I x)  Qt’^  (x,  u)  -  P  ( u  |  x)  Qrt’f  (x,  u) 

x  11 

This  in  turn  yields 


Rrx’f  (P)  <  Y 71  p M  Y  H pJ’f  ( : u\ \ Qt’f  (x>  u)  ' 


t=l\  u 


-  P[u\x)Qrt'f  [x,  u) 


+  Y  IMool^-^lll- 

f=l 


Note  that  Qrf p  depends  functionally  on  and  on  ft,  which  in  turn  depend  functionally  on  /r  but 

not  on/f+i,...,/j-.  This  ensures  that  any  algorithm  using  Qrt  p  respects  the  causality  constraint  that  any 
decision  made  at  time  t  depends  only  on  information  available  by  time  t. 

Focusing  on  stationarized  regret  and  upper-bounding  it  in  terms  of  the  Q-functions  is  one  of  the  key 
steps  that  lets  us  consider  a  simpler  setting  without  Markov  dynamics.  The  next  step  is  to  define  a  new 
type  of  relaxation  with  an  accompanying  new  admissibility  condition  for  this  simpler  setting.  That  is,  we 
will  find  a  relaxation  and  admissibility  condition  for  the  stationarized  regret  rather  than  for  the  expected 
steady-state  regret  directly.  A  new  admissibility  condition  is  needed  because  we  have  decoupled  current 
costs  from  past  actions,  which  makes  the  previous  admissibility  condition  (6)  inapplicable.  The  new  ad¬ 
missibility  condition  is  similar  to  the  one  in  [27] ,  which  was  derived  in  a  stateless  setting.  The  difference 
is  that  we  are  still  in  a  state-dependent  setting  in  the  sense  that  the  new  type  of  relaxation  is  indexed  by 
the  state  variable.  Now  instead  of  having  a  Markov  dynamics  that  depends  on  the  state,  we  consider 
all  the  states  in  parallel  and  have  a  separate  algorithm  running  on  each  state.  The  interaction  between 
different  states  is  generated  by  providing  these  algorithms  with  common  information  that  comes  from 
actual  dynamical  process.  Thus,  starting  from  this  new  admissibility  condition,  we  further  construct 
algorithms  using  relaxations  and  then  use  Lemma  1  to  bound  the  regret  of  these  algorithms. 


4.2  A  new  admissibility  condition  and  the  main  result 

Now  we  are  in  a  position  to  pass  to  a  simpler  setting  without  Markov  dynamics.  Instead,  we  associate 
each  state  with  a  separate  game.  Within  each  game,  the  agent  chooses  an  action  and  observes  a  signal 
from  the  environment,  and  the  current  cost  in  each  state  is  independent  from  the  past  actions  taken  in 
that  state.  The  signal  generated  by  the  environment  is  the  Q-function  mentioned  above  in  Assumption  1 . 
Although  here  we  don’t  use  the  one-step  cost  functions  fr  as  the  signal,  we  know  that  the  Q-functions 
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actually  contain  payoff-relevant  information  on  ft.  From  this  perspective,  the  environment  choose  one- 
step  cost  functions  ft  is  equivalent  to  letting  the  environment  choose  the  corresponding  Q-functions. 

To  proceed,  we  need  to  introduce  a  new  type  of  relaxation  with  a  new  admissibility  condition.  The 
reason  is  that  the  relaxation  V  defined  in  (6)  is  a  sequence  of  functions  with  a  state  variable,  and  the  state 
is  changing  at  every  time  step  according  to  the  state  transition  dynamics  and  the  agent’s  action.  If  we 
view  the  interaction  between  the  agent  and  the  environment  as  a  stochastic  game,  then  the  relaxation 
is  indeed  a  sequence  of  upper  bounds  on  the  conditional  values  of  this  game.  However,  after  stationar- 
ization,  we  end  up  with  a  family  of  relaxations  indexed  by  the  state  variable  —  for  each  state,  we  have 
a  separate  online  learning  game,  and  we  have  a  separate  relaxation  for  each  of  these  online  learning 
games.  Consequently,  there  is  no  Markov  dynamics  involved  in  each  of  the  new  relaxations.  The  new 
relaxation  at  each  state  x  e  X,  which  we  will  denote  by  {Wx}^=1,  is  a  sequence  of  upper  bounds  on  the 
conditional  value  of  the  corresponding  online  learning  game.  We  define  such  a  relaxation  as  follows. 

For  each  xeX,  let  T7€x  denote  the  class  of  all  functions  hx  :  U  —  R  for  which  there  exist  some  P  e 
U  |X)  and  /  e  & ,  such  that 


hx[u)  =  QPj{x,u),  Vue  U. 

We  say  that  a  sequence  of  functions  WX)t :  J€x  —  IR,  t  -  0, . . . ,  T,  is  an  admissible  relaxation  at  state  x  if  the 
following  condition  holds  for  any  hX} i, . ..,hXiTE  J€x\ 


T 

f=t 

WXit(hx)>  inf  sup  {Eu„v[HU)]  +  Wx,t+iihtx,hx)},  t=T- 1 . 0. 

ve3*(U)  hxejex 


W, 


x,T 


[hx)>-  inf  E[7_ 

ve^(U) 


(10a) 

(10b) 


Given  such  an  admissible  relaxation,  we  can  associate  to  it  a  behavioral  strategy 

ytix,ft~1)  =  Pj’f(-\x)  =  argmin  sup  {Eu^v[hx{U)]  +  Wx>t{ht~1,hx)} 

ve3*(U)  hxEJ€_ t 

hy,t  =  QT'^JV).  VyeX. 


(Even  though  the  above  notation  suggests  the  dependence  of  hy> t  on  the  T -tuples  y  and  /,  this  depen¬ 
dence  at  time  t  is  only  w.r.t.  yf  and  /f,  so  the  resulting  strategy  is  still  causal).  The  relaxation  {WX}J=1 
at  state  x  is  a  sequence  of  upper  bounds  on  the  conditional  value  of  the  online  learning  game  associ¬ 
ated  with  that  state.  In  this  game,  at  time  step  t,  the  agent  chooses  actions  utE  U  and  the  environment 
chooses  function  hx<t  £  Although  this  relaxation  is  still  state-dependent,  there  is  no  Markov  dynam¬ 
ics  involved  here,  which  means  that  now  the  state-free  techniques  of  [27]  can  be  brought  to  bear  on  the 
problem  of  constructing  algorithms  and  bounding  their  regret.  Specifically,  we  derive  a  separate  relax¬ 
ation  {WX}J_  |  and  the  associated  behavioral  strategy  for  each  state  xeX.  Then  we  assemble  these  into 
an  overall  algorithm  for  the  MDP  as  follows:  if  at  time  t  the  state  Xt  =  x,  the  agent  will  choose  actions 
according  to  the  corresponding  behavioral  strategy  yt(x,  •).  Note  that  although  the  agent’s  behavioral 
strategy  switches  between  different  relaxations  depending  on  the  current  state,  the  agent  still  needs  to 
update  all  the  h -functions  simultaneously  for  all  the  states.  This  is  because  the  computation  of  the  h- 
functions  (in  terms  of  the  Q  functions)  requires  the  knowledge  of  the  behavioral  strategy  at  other  states. 
In  other  words,  the  algorithm  has  to  keep  updating  all  the  relaxations  in  parallel  for  all  states. 

Under  the  constructed  relaxation,  we  state  our  main  result: 


Theorem  1  Suppose  that  the  MDP  is  unichain,  the  environment  is  oblivious,  and  Assumption  1  holds. 
Then,  for  any  family  of  admissible  relaxations  given  by  (10)  and  the  corresponding  behavioral  strategy  f, 
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we  have 


T 


£/f(Xf,Uf  )-¥(/) 

t=  1 


<  sup  Y, 71  p  M  wx,o  +  C&Y  II  p]'f  -  n]’f  ||  i 
PeJi f(U|X)  x  t=  1 


(11) 


where  =  sup/ei?  ||/|loo. 

Proof:  See  Appendix  D.  ■ 

This  general  framework  gives  us  a  recipe  for  deriving  algorithms  for  online  MDPs.  First,  we  use  sta- 
tionarization  to  pass  to  a  simpler  setting  without  Markov  dynamics.  Here  we  need  to  find  Qf  functions 
satisfying  (9)  with  forcing  function  ft  at  each  time  t.  In  this  simpler  setting,  we  associate  each  state  with 
a  separate  online  learning  game.  Next,  we  derive  appropriate  relaxations  (upper  bounds  on  the  condi¬ 
tional  values)  for  each  of  these  online  learning  games.  Then  we  plug  the  relaxation  into  the  admissibility 
condition  (10)  to  derive  the  associated  algorithm.  This  algorithm  in  turn  gives  us  a  behavioral  strategy  for 
the  original  online  MDP  problem,  and  Theorem  1  automatically  gives  us  a  regret  bound  for  this  strategy. 
We  emphasize  that,  in  general,  multiple  different  relaxations  are  possible  for  a  given  problem,  allowing 
for  a  flexible  tradeoff  between  computational  costs  and  regret. 

We  have  reduced  the  original  problem  to  a  collection  of  standard  online  learning  problems,  each  of 
which  is  associated  with  a  particular  state.  We  proceed  by  constructing  a  separate  relaxation  for  each  of 
these  problems.  Because  we  have  removed  the  Markov  dynamics,  we  may  now  use  available  techniques 
for  constructing  these  relaxations.  In  particular,  as  shown  by  [26],  a  particularly  versatile  method  for 
constructing  relaxations  relies  on  the  notion  of  sequential  Rademacher  complexity  (SRC) . 


5  Example  derivations  of  explicit  algorithms  using  our  framework 

The  algorithms  derived  using  our  general  framework  belongs  to  a  class  of  algorithms  called  expert  advice 
algorithm  [7].  Expert  advice  algorithm  is  a  well-known  method  that  combines  the  recommendation  of 
several  individual  “experts”  into  another  strategy  of  choosing  actions.  Every  expert  is  assigned  a  “weight” 
indicating  how  much  the  agent  trusts  that  expert,  based  on  the  previous  performance  of  the  experts.  The 
weights  direct  the  agent  with  regard  to  which  expert  to  follow  at  the  next  time  step.  The  most  popu¬ 
lar  algorithm  for  the  expert  advice  framework  is  the  Randomized  Weighted  Majority  algorithm  (RWM), 
sometimes  called  Hedge.  RWM  maintains  a  set  of  weights  over  experts  and  updates  these  weights  multi  - 
plicatively.  It  has  an  alternative  interpretation  from  a  regularization  perspective.  It  has  been  shown  that 
the  weights  chosen  by  this  RWM  algorithm  minimize  a  combination  of  empirical  cost  and  an  entropic 
regularization  term.  This  observation  makes  RWM  algorithm  a  special  case  of  a  broader  class  of  algo¬ 
rithms  known  as  follow  the  regularized  leader  (FRL)  algorithms  [30] .  In  order  to  add  some  stability,  RWM 
induces  high  entropy,  which  leads  to  more  uniform  weights.  The  algorithms  we  derive  in  this  section  are 
examples  in  which  RWM  algorithms  are  applied  to  each  state. 

5. 1  Recovering  an  expert-based  algorithm  for  online  MDPs 

Similar  to  our  set-up,  [9]  consider  an  MDP  with  arbitrarily  varying  cost  functions.  The  main  idea  of  their 
work  is  to  efficiently  incorporate  existing  expert-based  algorithms  [7,  20]  into  the  MDP  setting.  For  an 
MDP  with  state  space  X  and  action  space  U,  there  are  |U||X|  deterministic  Markov  policies  (state  feedback 
laws),  which  renders  the  obvious  approach  of  associating  an  expert  with  each  possible  deterministic  pol¬ 
icy  computationally  infeasible.  Instead,  they  propose  an  alternative  efficient  scheme  that  works  by  asso¬ 
ciating  a  separate  expert  algorithm  to  each  state,  where  experts  correspond  to  actions  and  the  feedback 
to  provided  each  expert  algorithm  depends  on  the  aggregate  policy  determined  by  the  action  choices  of 
all  the  individual  algorithms.  Under  a  unichain  assumption  similar  to  the  one  we  have  made  above,  they 
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show  that  the  expected  regret  of  their  algorithm  is  sublinear  in  T  and  independent  of  the  size  of  the  state 
space.  Their  algorithm  can  be  summarized  as  follows: 


Put  in  every  state  x  an  expert  algorithm  Bx 
for  f  =  1,2, ...  do 

Let  Pt  (-|xf)  be  the  distribution  over  actions  of  BXt 
Use  policy  Pt  and  obtain  ft  from  the  environment 
Feed  Bx  with  loss  function  QPt,ft{x,-)  =  E  [L“0  Ufi-rf 

where  E  is  taken  w.r.t.  the  Markov  chain  induced  by  Pt  from  the  initial  state  x. 

p  f 

q’  is  the  steady-state  cost  (nPt  ®  Pt,ft >,  where  7ipt  is  the  invariant  distribution  of  Pt. 
end  for 


As  we  show  next,  the  algorithm  proposed  by  [9]  arises  from  a  particular  relaxation  of  the  kind  that  was 
introduced  in  the  preceding  section.  For  every  possible  state  value  x  e  X,  we  want  to  construct  an  admis¬ 
sible  relaxation  that  satisfies  (10) .  Here  we  show  that  the  relaxation  can  be  obtained  as  an  upper  bound  of 
a  quantity  called  conditional  sequential  Rademacher  complexity,  which  is  defined  in  [27]  as  follows.  Let  e 
be  a  vector  (ei,...,£r)  of  i.i.d.  Rademacher  random  variables,  i.e.,  Pr(£;  =  +1)  =  1/2.  For  a  given  xeX.an 
Jtfx-vahxed  tree  h  of  depth  d  is  defined  as  a  sequence  (hi , . . . ,  h^)  of  mappings  hf :  {±l}r_1  —  where 
J€x  is  the  function  class  defined  in  Section  4.2.  Then  the  conditional  sequential  Rademacher  complexity 
at  state  x  is  defined  as 


&x,t ihx)  =  sup Ee  j.j.max 

h  UE  U 


T  t 

2  Y.  [hs-f(er+l:5-l)]  (m)-  Y  hx,s(u)  , 
s=f+l  s=l 


Vhxejex. 


Here  the  supremum  is  taken  over  all  J€x -valued  binary  trees  of  depth  T  -  t.  The  term  containing  the 
tree  h  can  be  seen  as  “future",  while  the  term  being  subtracted  off  can  be  seen  as  “past".  This  quantity  is 
conditioned  on  the  already  observed  hx,  while  for  the  future  we  consider  the  worst  possible  binary  tree. 
As  shown  by  [27],  this  Rademacher  complexity  is  itself  an  admissible  relaxation  for  standard  (state-free) 
online  optimization  problems;  moreover,  one  can  obtain  other  relaxations  by  further  upper-bounding 
the  Rademacher  complexity.  As  we  will  now  show,  because  the  action  space  U  is  finite  and  the  functions 
in  J€x  are  uniformly  bounded  (Assumption  1),  the  following  upper  bound  on  5?Xjf(-)  is  an  admissible 
relaxation,  i.e.,  it  satisfies  condition  (10): 


where  the  learning  rate  p  >  0  can  be  tuned  to  optimize  the  resulting  regret  bound.  This  relaxation  leads 
to  an  algorithm  that  turns  out  to  be  exactly  the  scheme  proposed  by  [9] : 


Proposition  3  The  relaxation  (12)  is  admissible  and  it  leads  to  a  recursive  exponential  weights  algorithm, 
specified  recursively  as  follows:  for  allxeX,  ue  U 


Pt+i(u\x) 


Pt[u\x)  exp  ^ 

“ftx,f(w)) 

^Pf(-|x),exp 

(  ) 

l> 

Vi(w) expj 

pT,ts=ihx,s(.u)^ 

(  v1(  exp 

t  =  0,...,T-l 


where  v i  is  the  uniform  distribution  on  U. 
Proof:  See  Appendix  E. 


(13) 
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The  above  algorithm  works  with  any  collection  of  Q  functions  satisfying  the  reverse  Poisson  inequalities 
determined  by  the  ft’s  (recall  Assumption  1).  Here  is  one  particular  example  of  such  a  function  —  the 
usual  Q-function  that  arises  in  reinforcement  learning  and  that  was  used  by  [9],  Recall  our  assumption 
that  every  stochastic  state  feedback  law  P  e  .W(U|X)  has  a  unique  stationary  distribution  Tip.  For  given 
choices  of  P  e  M  (U  |X)  and  /  e  consider  the  function 


Qpf(x,  u)  =  lim  Ep 

T—*o o 


£/(Xr,E/f)-<JTP®/>/> 


t=  l 


X\  =  x,  U\  =  u 


where  Xt  and  JJt  are  the  state  and  action  at  time  step  t  after  starting  from  the  initial  state  X\  =  x,  applying 
the  immediate  action  U\  =  u,  and  following  P  onwards.  It  is  easy  to  check  that  Qp/(x,  u)  satisfies  the 
reverse  Poisson  inequality  for  P  with  forcing  function  /.  In  fact,  it  satisfies  (9)  with  equality.  We  can  also 
derive  a  bound  on  the  Q-function  as  a  function  of  the  mixing  time  t.  Let  us  first  bound  Qp/(x,  P)  where 

p  f 

P  is  used  on  the  first  step  instead  of  u.  For  all  t,  let  pxt  be  the  state  distribution  at  time  t  starting  from  x 
and  following  P  onwards.  So  we  have 


Qp.fix,  P)  =  lim  £  (pP/t  ®  P,  f)  -  (Tip  ®  P,  f) 

r-oor=1L 


llooLllM^-ttpi^lll 

t=  1 


<2||, 
<2711/1100, 


OO  ^  0 

f=l 


—th 


where  the  first  inequality  results  from  repeated  application  of  the  uniform  mixing  bound  (7) .  Due  to  the 
fact  that  the  one-step  cost  is  bounded  by  C&  =  sup ||/|loo,  we  have 

Qpj(x,  u)  <  Qptf(x,P)  +  f(x,  u)  -  ®P,f)<  2t Cf  +  Cf  <  3t Cp. 

We  can  now  establish  the  following  regret  bound  for  the  exponential  weights  strategy  (13): 

Theorem  2  LetL  =  L(X,  U,iF).  Assume  the  state  transition  dynamics  have  a  unichain  structure.  Then  for 
the  relaxation  (12)  and  the  corresponding  behavioral  strategy  f  given  by  (13)  with  an  appropriate  choice 
of  p,  we  have 


fr.f 


t=i 


<  2L\/2Tlog|U|  +  t 


/logiuir 


+  2tCjt. 


Proof:  See  Appendix  F.  ■ 

As  we  can  see,  this  regret  bound  is  consistent  with  the  bound  derived  in  [9] .  Therefore,  we  have  shown 
that  our  framework,  with  a  specific  choice  of  relaxation,  can  recover  their  algorithm.  The  advantage 
of  our  general  framework  is  that  we  can  analyze  the  part  of  the  corresponding  regret  bound  simply  by 
instantiating  our  analysis  on  specific  relaxations,  without  the  need  of  ad-hoc  proof  techniques  applied 
in  [9], 


5.2  A  novel  lazy  RWM  algorithm  for  online  MDPs 

In  the  preceding  section,  we  used  our  framework  to  recover  a  particular  policy  for  an  online  MDP  that 
relies  on  exponential  weight  updates.  In  this  section,  we  derive  a  “lazy"  version  of  that  policy,  which  can 
be  interpreted  as  a  lazy  RWM  algorithm.  The  starting  point  is  to  divide  time  into  phases  of  increasing 
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length,  so  that  during  each  phase  the  agent  applies  a  fixed  state  feedback  law.  The  main  advantage 
of  lazy  strategies  is  their  computational  efficiency,  which  is  the  result  of  a  looser  relaxation  and  hence 
suboptimal  scaling  of  the  regret  with  the  time  horizon. 

We  partition  the  set  of  time  indices  1,2,...  into  nonoverlapping  contiguous  phases  of  (possibly)  in¬ 
creasing  duration.  The  phases  are  indexed  by  m  e  N,  where  we  denote  the  mth  phase  by  STm  and  its 
duration  by  rm.  We  also  define  3 \-m  =  3\  u  . . .  u  3m  (the  union  of  phases  1  through  m)  and  denote  its 
duration  by  t \:m.  Let  M  <  T  denote  the  number  of  complete  phases  concluded  before  time  T.  Here  we 
need  a  describe  a  generic  algorithm  that  works  in  phases: 


Initialize  at  t  =  0  and  phases  3\ , . . . ,  3m  s.t.  T\-m  =  T 
For  t  £  3\,  choose  ut  uniformly  at  random  over  U 
for  m  =  2,3,... 
for  t  £  3~m  do 

if  the  process  is  at  state  xt,  choose  action  ut  randomly  according  to  Pm{u\x) 
where  Pm(u\x)  is  the  state  feedback  law  only  using  information  from  phase  1  to  m  -  1 
end  for 
end  for 


Because  in  this  section  we  work  in  phases  instead  of  time  steps,  we  need  to  provide  an  alternative 
definition  of  relaxations  and  admissibility  condition.  For  every  state  x  e  X,  we  denote  by  h'f  the  xm- 
tuple  {hXiS  :  s  e  3m),  andby  hx>  l:m  the  r1:m -tuple  (hx,i,  ftx,2,  ■■■,  hXJ  1:m).  For  each  x  e  X,  we  will  say  that  a 
sequence  of  functions  WX}Tn  :  3€Jx'm  — ►  K,  m  =  1, . . . ,  M,  is  an  admissible  relaxation  if 


^  -  inf  E u~ 

V£^(U) 


E  hx,m 

f=i 


Wx,m{hx>i:m)  >  inf  sup  { 

ve3,(U)We*'" 


E  KsW) 

S£3~m 


+  Lkx,m+1  (hx,l:mi  Hx  )^,  t  —  M  —  1, . . . ,  1 


For  a  given  state  x,  we  also  define  the  conditional  sequential  Rademacher  complexity  in  terms  of  phases: 


Zx,m(hx,i:m)  =  sup  E£m+1.M  max 
l!  ueU 


M 


2  E  ei  E  [hx,t(e)]  (m)~E  E  hx,siu) 


j=m+l  t^STj 


i=lse3~i 


Here  the  sup  remum  is  taken  over  all  3€x -valued  binary  trees  of  depth  M-  m.  In  the  preceding  section, 
we  replaced  the  actual  future  induced  by  the  infimum  and  supremum  pairs  in  the  conditional  value  by 
the  “worst  future"  binary  tree,  which  involves  expectation  over  a  sequence  of  coin  flips  in  every  time 
step.  By  contrast,  in  the  above  quantity  we  replace  the  real  future  by  the  “worst  future"  binary  tree  that 
branches  only  once  per  phase.  Now  we  can  construct  the  following  relaxation: 


Wx,m0ix,i-.m)  =  plog|  L  exP 
VmeU 


h 


Y  m 

—  EE  hx.sM 
P  ;  =  1  se.Ti 


2L(X,  \J,3)2  M 

+ -  E 


j-m+\ 


4 


(14) 


The  corresponding  algorithm,  specified  in  (15)  below,  uses  a  fixed  state  feedback  law  throughout  each 
phase: 


Proposition  4  The  relaxation  (14)  is  admissible  and  it  leads  to  the  following  Markov  policy  for  phase  m: 


P t,m  (M|X)  — 


Vi [u] exp( 

—  Ks(u)) 

(  v1(  exp 

(15) 


where  v i  is  the  uniform  distribution  on  U. 
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Proof:  See  Appendix  G.  ■ 

Now  we  derive  the  regret  bound  for  (15): 

Theorem  3  Let  L  =  L(X,  Under  the  same  assumptions  as  before,  the  behavioral  strategy  f  corre¬ 

sponding  to  (15)  enjoys  the  following  regret  bound: 


rf./ 


T 

Zft(Xt,Ut)-'V(f) 

<  2L, 

t=l 

N 

M 


2  log  i  u  i  +  -- 

M  i=l  1 


2C&M 


,-l/r  ' 


(16) 


Proof:  See  Appendix  H.  ■ 

Our  behavioral  strategy  (13)  is  a  mixed  strategy  for  choosing  actions,  and  is  essentially  a  randomized 
weighted  majority  (RWM)  algorithm  developed  in  online  learning.  As  a  result,  our  recursive  algorithm 
(13)  can  be  seen  as  a  FRL  algorithm.  We  add  randomness  automatically  when  choosing  actions  by  using 
a  mixed  strategy;  effectively,  we  are  using  a  FPL  algorithm  to  choose  actions.  By  presenting  a  “lazy" 
version  of  that  recursive  algorithm,  we  derive  a  novel  lazy  FPL  algorithm  which  is  similar  in  spirit  to  the 
algorithm  in  [33] . 

[33]  also  consider  a  similar  model  where  the  decision-maker  has  full  knowledge  of  the  transition 
kernel  and  the  costs  are  chosen  by  an  adversary.  They  propose  an  algorithm  for  MDPs  with  arbitrarily 
changing  costs  that  achieves  sublinear  regret  based  on  the  oblivious  opponent  assumption.  Their  al¬ 
gorithm  computes  and  changes  policy  periodically  according  to  a  perturbed  version  of  the  empirically 
observed  cost  functions,  and  follows  the  computed  stationary  policy  for  increasingly  long  time  intervals. 
As  a  result,  their  algorithm  has  diminishing  computational  effort  per  time  step  and  is  computationally 
more  efficient  than  that  of  [9] .  [33]  call  their  algorithm  the  lazy  FPL  algorithm. 

Although  our  new  algorithm  is  similar  in  nature  to  the  algorithm  presented  in  [33] ,  our  method  has 
several  advantages.  First,  in  the  algorithm  of  [33],  the  policy  computation  at  the  beginning  of  each  phase 
requires  solving  a  linear  program  and  then  adding  a  carefully  tuned  random  perturbation  to  the  solu¬ 
tion.  As  a  result,  the  performance  analysis  in  [33]  is  rather  lengthy  and  technical  (in  particular,  it  invokes 
several  advanced  results  from  perturbation  theory  for  linear  programs) .  By  contrast,  we  can  analyze  the 
part  of  the  corresponding  regret  bound  simply  by  instantiating  our  analysis  on  specific  relaxations,  and 
we  don’t  need  to  add  additional  randomization,  which  renders  our  proof  much  less  technical.  Second, 
the  regret  bound  of  Theorem  3  shows  that  we  can  control  the  scaling  of  the  regret  with  T  by  choosing  the 
duration  of  each  phase.  By  contrast,  the  algorithm  of  [33]  relies  on  a  specific  choice  of  phase  durations 
in  order  to  guarantee  that  the  regret  is  sublinear  in  T  and  scales  as  0(T3/4).  We  show  that  if  the  horizon 
T  is  known  in  advance,  then  it  is  possible  to  choose  the  phase  durations  to  secure  0(T2/3)  regret,  which 
is  better  than  the  0(T3/4)  bound  derived  by  [33] . 

Corollary  1  For  a  given  horizon  T,  the  optimal  choice  of  phase  lengths  is  T1/3,  which  gives  the  regret  of 
0(T213). 

Proof:  See  Appendix  I.  ■ 


6  Conclusions 

We  provide  a  unified  viewpoint  on  the  design  and  the  analysis  of  online  MDPs  algorithms  which  is  an 
extension  of  a  general  relaxation-based  approach  of  [26]  to  a  certain  class  of  stochastic  game  models.  We 
showed  that  an  algorithm  previously  proposed  by  [9]  naturally  arises  from  our  framework  via  a  specific 
relaxation.  Moreover,  we  showed  that  one  can  obtain  lazy  strategies  (where  time  is  split  into  phases,  and 
a  different  stationary  policy  is  followed  in  each  phase)  by  means  of  relaxations  as  well.  In  particular,  we 
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have  obtained  a  new  strategy,  which  is  similar  in  spirit  to  the  one  previously  proposed  by  [33] ,  but  with 
several  advantages,  including  better  scaling  of  the  regret. 


A  Proof  of  Proposition  1 


The  agent’s  closed-loop  behavioral  strategy  y  is  a  tuple  of  mappings  yt  :  S;t  1  — *  £?(U),  1  <  t  <  T;  the 
environment’s  open-loop  behavior  strategy  /  is  a  tuple  of  functions  (/i , . . . ,  fr)  in Thus, 


V(x)  =  inf  sup  E 
r  / 


r-f 

X 


T 


t=  1 


T 

=  inf. . .  inf  sup ...  sup  Erx’""rT'fl fr  £  ft(Xt,  Ut)  -  T'  (/) 

Tl  TT  A  fr  lf=l 


We  start  from  the  final  step  T  and  proceed  by  backward  induction.  Assuming  yi,...,yr-i  were  already 
chosen,  we  have 


inf  sup 

Tt  A . At 

inf  su 

^ T  A’—’fr-i  fr 


T- 1 


X  [MX t,  Ut)]  +  MXT,  Ut)  -  TT/3 

t=  l 


yl- I  fT- 1 

=  inf  sup  sup  <  E'x 


rr-i 

x  [MXt.Ut)] 

\t=  1 


+  Er^T-'AMXr,Url-Vif)] 


r  yT~l  fr_1 

=  inf  sup  <  Ex  1J 

rr  A’—’fr-i  l 

)r-i  fT-i 

<E-l  J 


T-l 


E  [MXt,ut)] 


t=i 


-sup!’  ■ri'f  ’fT 
fr 


[fT{XT,  uT)  -  'f'(/r)] 


f  r-i 


E  \MXt,ut)] 

t= i 


+  inf  sup  El  "  ’rr>/  ^  ’/r  [fT[XT,UT)-^{fT)]\. 

PT{UT\XT)  j?T  I 


The  last  step  is  due  to  the  easily  proved  fact  that,  for  any  two  sets  A,  B  and  bounded  functions  gi :  A  —  R, 
g2  :  A  x  B  -+  K, 


inf  sup{gi(a)  +  g2(a,y(a))}  =  sup 

r-A^B  a  a 


gi(a)+  inf  g2{a,  b) 

btB 


(see,  e.g.,  Lemma  1.6.1  in  [5]).  Proceeding  inductively  in  this  way,  we  get  (4). 
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B  Proof  of  Proposition  2 

The  proof  is  by  backward  induction.  Starting  at  time  T  and  using  the  admissibility  condition  (6) ,  we  write 


cf’f 


ZMXt.uj-vif) 


t=i 


<  E 


=  eF 


=  Ef,/ 

LJC 


YMXt,Ut)  +  VT(XT+1,fJ 
r=t 

r-i 


YMxt,ut) 

t= 1 

r-i 

ZMXfUt) 

t= t 


+  E$’f  [fT{XT,  UT)  +  VT(XT+i,f  T)\ 


+  2>r(*rH  Y  fT(xT,u)  [fT[xT,fT  1)]{u)  +  E\vT(XT+i,fT)  xT,jT{xT,fT  X) 


XT 


llE  U 


<  E 


■?./ 


r-t 


Y  MXt,Ut)  +  VT-l[XT,f 

t=  t 


T-U 


where  /J  j  £  £^(X)  denotes  the  probability  distribution  of  Xp.  The  last  inequality  is  due  to  the  fact  that  f 
is  the  behavioral  strategy  associated  to  the  admissible  relaxation  {Vt}J_Q.  Continuing  in  this  manner,  we 
complete  the  proof. 


C  Proof  of  Lemma  1 


Let  us  take  expectations  of  both  sides  of  (9)  w.r.t.  np’  ®P'\ 


(Tip  ®P,g >  -  <: Tip '  ®  P',g >  <  E^,«p-{e[Q(F,P)|X,  U]  -  Q(X,  U)} 
=  Y^p'MP'{u\x)Ie[Q(Y,P)\x,  u\-Q(x,  u)l 

x,u  1  J 


=  Y71p'(x)P'{u\x)E[Q{Y,P)\x,  u]~Y  £>Wy)^(xly,f’' 


x,u 


x,u\  y 


P'{u\x)Q[x,  u) 


where  in  the  third  step  we  have  used  the  fact  that  np'  is  invariant  w.r.t.  K{-\-,P').  Then  we  have 


Y^p'^x)P'{u\x)E[Q{Y,P)\x,u\-Y  P\u\x)Q{x,u) 

x,u  x,u  y  y 

=  Y^P'ix)\YPl(u\x)E[Q{Y,P)\x,  u]-YK(y\x,P')P'tu\y)Q{y,u)\ 


u,y 


=  Ynp'(x)\YPr(u\x)E[Q(Y,P)\x,  u]-YKiy\x,P')Q(y,P') 
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where  the  second  step  is  by  definition  of  Q  (y,  P') .  Then  we  can  write 

2>p'(x)<£p'(u|x)E[Q(F,P)|x,  u]-'EK(y\x,P,)Q{y,P'y 


=  J>j»(x)  \  £P'(u|x)  E[Q(F,P)|x,  u]  -  J^K{y\x,  u)Qly,Pl) 

x  [  u  \  y 

=  £ttp<(x)P'(u|x){e[Q(F,P)|x,  u]  -  E[Q(F,P')|x,  w]} 


(b) 


=  y  np'{x)p\u\x)Kty\x,u)\Yjp(u'\y}Q(y’u'}-Yjp'(u'\y}Q(y’d>\ 


x,u,y 


(C) 


=  ^^pdx)i<:(y|x,P')^P(n,|y)Q(y. 


x,y 


(d) 


=  Y7 ip'WY  [•P(«lx)Q(x,  u)  -  P' {u\x)Qlx,  u )] , 


where  (a)  and  (c)  are  by  definition  of  K[-\-,P');  (b)  is  by  definition  of  Qly,P');  and  in  (d)  we  use  the  fact 
that  7i p’  is  invariant  w.r.t.  K[-\-,P'). 


D  Proof  of  Proposition 


We  have 


r  Y’f 


YMxt,ut)-w{f) 

t= i 


<  sup  Y 

U|X)  f=l 


^  T 

®PY’f,ft)-(7tp®P,ft)}  +  Y  ll/tlloollpf^-^lll 

J  t=  1 


<  sup  YnP(x)Y  ^•Pf’^(w|x)Qf’^(x)  u)  -P(w|x)Qf’^(x,  u)|  +  Y  ll/f  lloollpf 

PeM( U|X)  x  f=l  u  I  f=l 

where  in  the  first  equality  we  have  used  (8),  while  the  second  inequality  is  by  Lemma  1.  Then  we  write 
the  last  term  out  and  get 


sup  Y^pM 

PeM( U|X)  x 

+  Y  ll/flloollp[’/-?rf/lli 

r=i 

r-i 

<  SUp  ^7Tp(x) 
Pe^(U|X)  x 

+  Y  H/tlloollP?’/-^?’/|ll 


r-i  ^  ~  „  \  ,  ^  t  ^ 

L  Lpt  {u\x)Q^p  (x,  u)  \  +YptJ  (Mlx)Q^  (x,  u)-Y  P(m|x)Q,  (x,  u) 

t=\  u  I  u  t=l 


Y  Ypf  ’f(u lx)Q?’^(x,  U)}  +  Yppf  (u\x)ol’f  lx,  u)  +  WxjlhTx) 

t=l  u  I  u 


t=l 


T 

+  Y  IIMIoollp^-^Hl- 

f=l 


<  sup  ^7Tp(x)  Y  LPFiulxiQFfx^iUw^-iCfiJ  :) 

P£^/(U|X)  X  t=  1  li  / 

where  the  two  inequalities  are  by  the  admissibility  condition  (10).  Continuing  this  induction  backward, 
and  noting  that£f=i  Wft\\ooWp\’f  W\  ^  QrZf=1  II -  n]’f  ||  i ,  we  arrive  at  (11). 
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E  Proof  of  Proposition  3 


First  we  show  that  the  relaxation  (12)  arises  as  an  upper  bound  on  the  conditional  sequential 
Rademacher  complexity.  The  proof  of  this  is  similar  to  the  one  given  by  [26] ,  except  that  they  also  opti¬ 
mize  over  the  choice  of  the  learning  rate  p.  For  any  p  >  0, 


where  the  first  inequality  is  by  Jensen’s  inequality,  while  the  second  inequality  is  due  to  the  non¬ 
negativity  of  exponential  function.  Then  we  pull  out  the  second  term  inside  the  expectation  Ee  and 
get 


I  l  t 

plog  Y  exp 

\UE  U  P  S=  1 


1 

T-t 

(2  \ 

hx,s^u)  Iff 

n  exp 

-Ei  [hj(e)]  (m) 

) 

(=i 

\P  ) 

<plog 

<plog 

<plog 


IkeU  P  5=1 


T-t 


Y  exp  — Y  hx,siu)  x  exp  max  Y  ( [h,(e)]  (w)) 


pz  ei,...,er_(e{±l}  -=1 


T-t 


Y  exp  — Y  hx,sM  maxexp  max  Y  ( [h/(e)]  (w)) 


UeU  P  5=1  / 

f  i  1 

Y  exp  —  Y  hx,s(u) 

UeU  P  5=1 


\P  £i . £t-M+V  i=1 

2  T~t  2 
+  - sup  max  max  V  f[h;  (£)](«))  , 

p  h  “£U  ei,..,£r-te{±ll  -Z i 


where  the  first  inequality  is  due  to  Hoeffding’s  lemma  (see,  e.g.,  Lemma  A.  1  in  [7])  applied  to  the  expecta¬ 
tion  w.r.t.  e.  The  last  term,  representing  the  worst-case  future,  is  upper  bounded  by  ^(T-  t)L(X,  U,^)2. 
We  thus  obtain  our  exponential  weight  relaxation  from  (12). 

Next  we  prove  that  the  relaxation  (12)  is  admissible  and  leads  to  the  recursive  algorithm  (13) .  To  keep 
the  notation  simple,  we  drop  the  subscript  x  in  the  following.  In  particular,  we  use  hr  for  hXit,  Wt  for 
WXit,  vt  for  Pt (-|x),  etc.  The  admissibility  condition  to  be  proved  is 


sup  {E u„Vt  [ht{U)\  +  Wt{h f)}  < 

hte,tfx 


Note  that 
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We  have 
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t-i 
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where  the  hrst  equality  is  due  to  the  fact  that  Vi  is  the  uniform  distribution  on  U,  while  the  inequality  is 
due  to  Hoeffding’s  lemma.  Plugging  the  resulting  bound  into  the  admissibility  condition,  we  get 

sup  {E [/_V(  [ht (LT)]  +  wx,t (hf)} 
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<plog 


,MEU 


I  l1-1 
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Thus,  the  recursive  algorithm  (13)  is  admissible  for  the  relaxation  (12). 


F  Proof  of  Theorem  2 


Again,  we  drop  the  subscript  x  and  write  vt  for  Pt  (-|x),  etc.  We  have 


T 


Yft(Xt,Ut)-V(f) 

f=l 


<  sup  Ynp(x)wx,0  +  C&Y 

P£^(U|X)  X  t=  1 


(17) 


From  the  relaxation  (12),  it  is  easy  to  see  WXt o  <  2L^/2Tlog\\J\  for  all  states  x  (in  fact,  the  bound  is  met 
with  equality  with  the  optimal  choice  of  p  =  y  ilpjf )  •  Since  we  have  bounded  the  first  term,  now  we 
focus  on  bounding  the  second  term  of  the  regret  bound. 

The  relative  entropy  between  vt  and  vf_i  is  given  by 


where 


D(vf||vt_i) 


/  exp(-ii;:{).s)\  (vi.expj-ilH/.,)) 

(  Vfjlog - 7 - -  )+log- - 7 - — 

\  exp(-Jl£?  hsy  (Vl,exP  (-±LlZ\hs)) 


i>  +  log 
P 


(vi,exp| 

B^l) 

(vi,exp| 

B  «:!'■*)) 

(vi,exp| 

(vi.expj 

1  t_1 

Y  vi(B  exp  — Yh^u) 

meU  P  s=l 


exp 


1 

\P 


-hf-i(u) 


vi.exp 


BeEi*. 


Vf.exp 


\P 


it- 1 


(18) 
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Using  Hoeffding’s  lemma,  we  can  write 


log 


(vi,exp| 

(vi.expj 

P  2  p2 


Substituting  this  bound  into  (18),  we  see  that  the  terms  involving  the  expectation  of  ht-\  w.r.t.  vt  cancel, 
and  we  are  left  with 


D(v  f||vf_i)< 

2 


Plugging  in  the  optimal  value  of  p  and  using  Pinsker’s  inequality  [8] ,  we  find 


/  log  |  U  | 
2  T 


So  far,  we  have  been  working  with  a  fixed  state  x  e  X,  so  we  had  vt  =  Pj'^(-\x),  where  f  is  the  agent’s 
behavioral  strategy  induced  by  the  relaxation  (12).  Since  x  was  arbitrary,  we  get  the  uniform  bound 


max 

xeX 


pffi-\x)-prt:\  Mx) 


ff.fr 
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(19) 


Armed  with  this  estimate,  we  now  bound  the  total  variation  distance  between  the  actual  state  distribu¬ 
tion  at  time  t  and  the  unique  invariant  distribution  of  kJ’^.  For  any  time  k<  t,  we  have 
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(20) 


where  (a)  is  by  triangle  inequality;  (b)  is  by  invariance  of  tc^‘ ^  w.r.t.  Kj’^;  (c)  is  by  the  uniform  mixing 
bound  (7);  and  (d)  follows  from  repeatedly  using  (19)  together  with  triangle  inequality  and  the  easily 
proved  fact  that,  for  any  state  distribution  peSP{X)  and  any  two  Markov  kernels  P,  P'  e  Ji(\}  |  X) , 
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Letting  now  the  initial  state  distribution  be  jUi,  we  can  apply  the  bound  (20)  recursively  to  obtain 
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So,  the  second  term  on  the  right-hand  side  of  (17)  can  bounded  by 


Cy  E  ||pf/-7r[,/||i<C^T2Wl0g|*J|:r  +2tC^, 
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which  completes  the  proof. 


G  Proof  of  Proposition  4 


First  we  show  that  the  relaxation  (14)  arises  as  an  upper  bound  on  the  conditional  sequential 
Rademacher  complexity.  Once  again,  we  omit  the  subscript  x  from  hX:t  etc.  to  keep  the  notation  light. 
Following  the  same  steps  as  in  Appendix  E,  we  have,  for  any  p  >  0, 
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In  the  same  vein, 
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where  the  first  inequality  is  due  to  Hoeffding’s  lemma,  while  the  last  inequality  is  by  Assumption  1.  We 
thus  derive  the  relaxation  in  (14). 

Now  we  prove  that  this  relaxation  is  admissible,  and  leads  to  the  lazy  algorithm  (15)  The  admissibility 
condition  to  be  proved  is 
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where  vm  =  Pm(-|jc)  is  the  Markov  policy  used  in  phase  m.  We  have 
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Plugging  this  into  the  admissibility  condition,  we  have 
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So  the  lazy  algorithm  (15)  is  an  admissible  strategy  for  the  relaxation  (14). 
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H  Proof  of  Theorem  3 


The  state  feedback  law  Pj  p (-|x)  that  the  agent  applies  within  phase  m  is  the  same  for  all  t  e  STm,  and  we 
denote  it  by  Pjyf  {■  |x).  Let  Kj,/  denote  the  Markov  matrix  that  describes  the  state  transition  from  Xt  to 
Xt+ 1  if  f  G  3~m.‘  Thus,  we  can  write 

K?nf(y\x)  =  YJK{y\x,u)Pj’f(u\x),  Vx,yeX. 
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First,  we  show  that 
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where  7r;„  is  the  invariant  distribution  of  K, 
To  prove  (21),  we  write 
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where  the  last  inequality  is  by  Lemma  1.  By  writing  out  the  first  term  in  the  right  hand  side,  we  get 
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The  last  inequality  is  due  to  the  fact  that  f  is  the  behavioral  strategy  associated  to  the  admissible  relax¬ 
ation  {VkXiTO}^=1.  Continuing  this  induction  backwards,  we  arrive  at  (21). 

Next,  we  bound  the  two  terms  on  the  right-hand  side  of  (21).  From  the  form  of  the  relaxation  (14),  it 
is  easy  to  see  VFx,o  <  2LW  2  log  |  U  |  T.fi1  t?  for  all  states  x;  in  fact,  this  bound  is  attained  with  equality  if  we 
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2J^i  t2L2 

use  the  optimal  choice  p  =  y  — .  Since  we  have  bounded  the  first  term,  now  we  focus  on  bounding 

the  second  term  of  (21). 

From  the  contraction  inequality  (7)  it  follows  that,  for  every  k  e  {0, 1, . . . ,  Tm  -  1},  we  have 
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Plugging  it  in  (21),  we  have  shown  that 
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I  Proof  of  Corollary  1 

Let  us  inspect  the  right-hand  side  of  (16).  We  see  that  both  r2:  and  M  have  to  be  sublinear  in  T. 

Since  r,-  =  T  and  y/L|fyr?  <  y/QI  1  r,)2,  at  least  the  first  of  these  terms  can  be  made  sublinear, 
e.g.,  by  having  t  j  =  1  for  all  j.  Of  course,  this  means  that  M  =  T,  so  we  need  longer  phases.  For  example, 
if  we  follow  [33]  and  let  rm  =  [m1/3-fl  for  some  e  e  (0,1/3),  then  a  straightforward  if  tedious  algebraic 
calculation  shows  that  M  =  0(T3/4)  and  t2.  =  0(T5/8),  which  yields  the  regret  of  0(T3/4). 

However,  if  T  is  known  in  advance,  then  we  can  do  better:  ignoring  the  rounding  issues,  for  any 
constants  A\,  A2  >  0, 
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To  see  this,  let  us  first  fix  M  and  optimize  the  choice  of  the  t  j’ s: 


(22) 
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By  the  Cauchy-Schwarz  inequality,  we  have 


M 


7=1 


N 


MVrl 

h  J 


Thus,  t2  achieves  its  minimum  when  the  above  bound  is  met  with  equality.  This  will  happen  only 
if  all  the  tj’s  are  equal,  i.e.,  Tj  =  for  every  j  (for  simplicity,  we  assume  that  M  divides  T  —  otherwise, 
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the  remainder  term  will  be  strictly  smaller  than  M,  and  the  bound  in  (22)  will  still  hold,  but  with  a  larger 
multiplicative  constant).  Therefore, 


min  min 
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7=1  7=1 


(AiT 
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where  the  minimum  on  the  right-hand  side  (again,  ignoring  rounding  issues)  is  achieved  by  M  =  T213 
and  r  j  =  T113  for  all  j.  This  shows  that,  for  a  given  horizon  T,  the  optimal  choice  of  phase  lengths  is  T113, 
which  gives  the  regret  of  0(T2/3),  better  than  the  0(T3/4)  bound  derived  by  [33]. 
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